Move or upload all .fast5 output files to your own home sub-directory. My directories are /hb/home/jatoy/bfren_genome/minion/runs/cell_*_fast5. (separate directories for cell_12 and cell_17)
Create two more directories in ~/bfren_genome/minion/runs/ called cell_12_fastq/ and cell_17_fastq/. We will direct the output files from guppy to these directories.
Run guppy (v5.0.15) to convert .fast5 files to .fastq files (base calling). This can be done automatically during the sequencing run, but the run will take longer, and if the computer crashes during that time, you may lose data. Recording the data in the .fast5 file format saves time and is safer. The guppy software then does the base calling and makes a new .fastq file for each .fast5 file. This way you can also run newer basecalling models on the same sequencing data as more accurate models are released.
Create the following slurm script file and save it as “guppy_v5_0_15.slurm”
#!/bin/bash
#SBATCH --job-name=guppy_2021-10-01
#SBATCH --output=guppy_2021-10-01.out
#SBATCH --error=guppy_02021-10-01.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jatoy@ucsc.edu
#SBATCH --partition=96x24gpu4
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --ntasks-per-node=24
# Above are all the specifics SLURM needs to run your specific job.
# --ntasks-per-node will be used in doParallel.R to specify the number
# of cores to use on the machine. Using 24
# now load programs needed
module load guppy
#run the command:
guppy_basecaller -c dna_r9.4.1_450bps_sup.cfg --num_callers 24 --device cuda:0 --input_path /hb/home/jatoy/bfren_genome/minion/runs/cell_12_fast5 --save_path /hb/home/jatoy/bfren_genome/minion/runs/cell_12_fastq
# --device option says, "Use the first GPU"
# dna_r9.4.1_450bps_sup.cfg is the "super accurate" bonito model for r9 flow cells
Submit the job:
sbatch guppy_v5_0_15.slurm
When the job has completed, you should have a .fastq corresponding to each .fast5 file.
Now repeat with run 17 data: Edit or create a new slurm script file and save it as “guppy_v5_0_15_run17.slurm”
#!/bin/bash
#SBATCH --job-name=guppy_2021-10-01_run17
#SBATCH --output=guppy_2021-10-01_run17.out
#SBATCH --error=guppy_02021-10-01_run17.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jatoy@ucsc.edu
#SBATCH --partition=96x24gpu4
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --ntasks-per-node=24
# Above are all the specifics SLURM needs to run your specific job.
# --ntasks-per-node will be used in doParallel.R to specify the number
# of cores to use on the machine. Using 24
# now load programs needed
module load guppy
#run the command:
guppy_basecaller -c dna_r9.4.1_450bps_sup.cfg --num_callers 24 --device cuda:0 --input_path /hb/groups/bernardi_lab/MinIon/BFR_17 --save_path /hb/home/jatoy/bfren_genome/minion/runs/cell_17_fastq
# --device option says, "Use the first GPU"
# dna_r9.4.1_450bps_sup.cfg is the "super accurate" bonito model for r9 flow cells
Now concatenate all the .fastq files (from both sequencing runs) into a single file called bfren_ref_bigfile.fastq:
cd ~/minion/runs
cat cell_12_fastq/pass/*.fastq cell_17_fastq/pass/*.fastq > bfren_ref_bigfile.fastq
Use NanoStat (v1.5.0) to summarize the sequencing run and calculate stats. I did this in an interactive session:
#request a node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
#ssh into your allocated node
ssh $SLURM_NODELIST
#run commands
module load miniconda3.9 #if you don't run this first, NanoStat won't load properly
conda activate /hb/groups/bernardi_lab/programs/nanostat_v1.5.0
cd ~/bren_genome/minion/runs/
NanoStat --fastq bfren_ref_bigfile.fastq --threads 24 --name bfren_ref_bigfile.fastq_statfastq_report
bfren_ref_bigfile.fastq_statfastq_report:
General summary:
Mean read length: 2,979.6
Mean read quality: 14.4
Median read length: 2,072.0
Median read quality: 14.5
Number of reads: 5,608,949.0
Read length N50: 4,850.0
STDEV read length: 2,955.3
Total bases: 16,712,372,110.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 5608949 (100.0%) 16712.4Mb
>Q7: 5608949 (100.0%) 16712.4Mb
>Q10: 5608667 (100.0%) 16712.0Mb
>Q12: 4764274 (84.9%) 14351.0Mb
>Q15: 2330888 (41.6%) 7254.2Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 37.7 (2036)
2: 35.0 (662)
3: 32.1 (274)
4: 28.6 (675)
5: 28.2 (54)
Top 5 longest reads and their mean basecall quality score
1: 87499 (14.5)
2: 83825 (11.0)
3: 80049 (15.0)
4: 77894 (14.8)
5: 77796 (13.9)
Compare with results from just run 12, using guppy v3: bfren_r_bigfile.fastq_statfastq_report:
General summary:
Mean read length: 2,197.2
Mean read quality: 10.8
Median read length: 998.0
Median read quality: 11.3
Number of reads: 2,080,084.0
Read length N50: 4,654.0
Total bases: 4,570,356,427.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 2011943 (96.7%) 4464.6Mb
>Q7: 1903219 (91.5%) 4267.0Mb
>Q10: 1504444 (72.3%) 3438.1Mb
>Q12: 700977 (33.7%) 1631.8Mb
>Q15: 8653 (0.4%) 7.1Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 20.9 (431)
2: 19.8 (1878)
3: 19.6 (288)
4: 19.3 (985)
5: 19.2 (270)
Top 5 longest reads and their mean basecall quality score
1: 86361 (12.0)
2: 84389 (8.2)
3: 82997 (9.6)
4: 79414 (12.0)
5: 77315 (11.8)
Check for fastq errors First make sure your concatenated fastq files have correctly merged using Fastq Utils (v0.25.1).
#request a node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
#ssh into your allocated node
ssh $SLURM_NODELIST
#load miniconda
module load miniconda3.9
#activate fastq_utils
conda activate /hb/groups/bernardi_lab/programs/fastq_utils_v0.25.1
#change working directory
cd ~/bfren_genome/minion/runs
#run command
fastq_info bfren_ref_bigfile.fastq.gz
fastq_info output:
fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from bfren_ref_bigfile.fastq.gz
Read name provided with no suffix
5600000Scanning complete.
Reads processed: 5608949
Memory used in indexing: ~1025 MB
------------------------------------
Number of reads: 5608949
Quality encoding range: 34 123
Quality encoding: sanger
Read length: 27 87499 2072
OK
Run FastQC (v0.11.7) on bigfile to get more detailed QC info on the reads:
module load fastqc
module load java #may not be necessary anymore, but if you get a java error, try loading this module
fastqc bfren_ref_bigfile.fastq.gz -t 24 -o ~/bfren_genome/minion/runs/fastqc/
If all looks good, move on to next step to remove NanoPore adapters
First, compress the “bigfile” fasta using gzip (the next step requires a zipped file as input).
gzip -v bfren_ref_bigfile.fastq
Then, run PoreChop (v2.4) to remove the NanoPore adaptor sequences from your reads.
#access node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
#load miniconda module
module load miniconda3.9
#activate PoreChop
conda activate /hb/groups/bernardi_lab/programs/porechop_v2.4
#change to working directory
cd ~/bfren_genome/minion/runs/
#run PoreChop
porechop -i bfren_ref_bigfile.fastq.gz -o bfren_ref_bigfile_pc2.fastq.gz --threads 24
Porechop output:
Trimming adapters from read ends
SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT
SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT
1D2_part_2_start: CTTCGTTCAGTTACGTATTGCTGGCGTCTGCTT
1D2_part_2_end: CACCCAAGCAGACGCCAGCAATACGTAACT
5,608,949 / 5,608,949 (100.0%)
5,300,176 / 5,608,949 reads had adapters trimmed from their start (185,081,259 bp removed)
3,482,449 / 5,608,949 reads had adapters trimmed from their end (55,258,957 bp removed)
Splitting reads containing middle adapters
5,608,949 / 5,608,949 (100.0%)
31,973 / 5,608,949 reads were split based on middle adapters
Saving trimmed reads to file
pigz not found - using gzip to compress
Saved result to /hb/home/jatoy/bfren_genome/minion/runs/bfren_ref_bigfile_pc.fastq.gz
Run fastq_info again on porechopped reads to check for errors in new fastq file:
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/fastq_utils_v0.25.1
cd ~/bfren_genome/minion/runs
fastq_info bfren_ref_bigfile_pc.fastq
fastq_info output:
fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from bfren_ref_bigfile_pc.fastq
Read name provided with no suffix
5600000Scanning complete.
Reads processed: 5614527
Memory used in indexing: ~1026 MB
------------------------------------
Number of reads: 5614527
Quality encoding range: 34 123
Quality encoding: sanger
Read length: 13 87449 2029
OK
Rerun NanoStat on porechopped reads to summarize the sequencing run and calculate stats.
First unzip bfren_ref_bigfile_pc.fastq.gz
gunzip bfren_ref_bigfile_pc.fastq.gz
Then run NanoStat again:
#request a node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
#ssh into your allocated node
ssh $SLURM_NODELIST
#run commands
module load miniconda3.9 #if you don't run this first, NanoStat won't load properly
conda activate /hb/groups/bernardi_lab/programs/nanostat_v1.5.0
cd ~/bren_genome/minion/runs/
NanoStat --fastq bfren_ref_bigfile_pc.fastq --threads 24 --name bfren_ref_bigfile_pc.fastq_statfastq_report
NanoStat Results for bfren_ref_bigfile_pc.fastq:
General summary:
Mean read length: 2,932.8
Mean read quality: 14.6
Median read length: 2,029.0
Median read quality: 14.7
Number of reads: 5,612,978.0
Read length N50: 4,830.0
STDEV read length: 2,947.6
Total bases: 16,461,999,922.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 5612977 (100.0%) 16462.0Mb
>Q7: 5612965 (100.0%) 16462.0Mb
>Q10: 5610436 (100.0%) 16459.2Mb
>Q12: 4821562 (85.9%) 14193.0Mb
>Q15: 2529345 (45.1%) 7398.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 90.0 (169)
2: 90.0 (499)
3: 90.0 (169)
4: 89.5 (233)
5: 51.2 (261)
Top 5 longest reads and their mean basecall quality score
1: 87449 (14.5)
2: 83790 (11.0)
3: 80001 (15.0)
4: 77865 (14.8)
5: 77769 (13.9)
Run NanoPlot on the reads to look at read size/quality distributions
#request a node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/nanoplot
NanoPlot -t 24 --fastq bfren_ref_bigfile_pc.fastq --plots hex dot --outdir nanoplot
Run FastQC again after runnging PoreChop and compare the results:
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
module load fastqc
fastqc bfren_ref_bigfile_pc.fastq --outdir fastqc
Now create two filtered sets of reads from the “bigfile”: one for longer reads of lower quality and one for shorter reads of higher quality.
#access node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
#load miniconda and activate NanoFilt
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/nanofilt
#make sure you're in the correct working directory
cd ~/minion/runs/
#run NanoFilt
gunzip -c bfren_ref_bigfile_pc.fastq.gz | NanoFilt -l 1000 | gzip > bfren_r_bigfile_pc_l000.fastq.gz
gunzip -c bfren_ref_bigfile_pc.fastq.gz | NanoFilt -l 500 | gzip > bfren_r_bigfile_pc_500.fastq.gz
#or if the bigfile_pc.fastq file is already unzipped
NanoFilt bfren_ref_bigfile_pc.fastq -l 1000 | gzip > bfren_ref_bigfile_pc_l000.fastq.gz
NanoFilt bfren_ref_bigfile_pc.fastq -l 500 | gzip > bfren_ref_bigfile_pc_500.fastq.gz
Align long reads into scaffolds using two different approaches: once using only the reads > 1000bp, and once using all reads > 500 bp
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/wtdbg2
wtdbg2 -x ont -g 600m -t 24 -L 1000 -i bfren_ref_bigfile_pc_l000.fastq.gz -fo bfr_ont_1000 #reads > 1000bp
wtdbg2 -x ont -g 600m -t 24 -L 500 -i bfren_ref_bigfile_pc_500.fastq.gz -fo bfr_ont_500 #reads >500bp
Now make a consensus sequence of the scaffolds.
wtpoa-cns -t 24 -i bfr_ont_1000.ctg.lay.gz -fo bfr_ont_1000.ctg.fa
wtpoa-cns -t 24 -i bfr_ont_500.ctg.lay.gz -fo bfr_ont_500.ctg.fa
Get stats on your new assemblies and compare them.
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/assembly_stats
assembly-stats bfr_ont_1000.ctg.fa > bfr_ont_1000.ctg.fa_assemblystats
assembly-stats bfr_ont_500.ctg.fa > bfr_ont_500.ctg.fa_assemblystats
The assembly that used all reads >500bp was more contiguous (1,008 scaffolds) with a greater average scaffold length (588,787.14 bp) than the assembly that used only reads >1000bp (1,033 scaffolds, average length 575,385.33), so from here on, we will use the more inclusive subset (reads >500 bp) and it’s corresponding assembly.
Run NanoStat for the final filtered dataset:
#request a node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
#ssh into your allocated node
ssh $SLURM_NODELIST
#run commands
module load miniconda3.9 #if you don't run this first, NanoStat won't load properly
conda activate /hb/groups/bernardi_lab/programs/nanostat_v1.5.0
cd ~/bren_genome/minion/runs/
NanoStat --fastq bfren_ref_bigfile_pc_500.fastq --threads 24 --name bfren_ref_bigfile_pc_500.fastq_statfastq_report
NanoStat Results for bfren_ref_bigfile_pc_500.fastq:
General summary:
Mean read length: 3,268.7
Mean read quality: 14.6
Median read length: 2,418.0
Median read quality: 14.7
Number of reads: 4,966,516.0
Read length N50: 4,886.0
STDEV read length: 2,973.0
Total bases: 16,234,043,863.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 4966515 (100.0%) 16234.0Mb
>Q7: 4966503 (100.0%) 16234.0Mb
>Q10: 4965319 (100.0%) 16231.7Mb
>Q12: 4276732 (86.1%) 14000.6Mb
>Q15: 2250354 (45.3%) 7299.6Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 37.7 (2036)
2: 35.0 (662)
3: 28.6 (675)
4: 27.7 (506)
5: 27.4 (651)
Top 5 longest reads and their mean basecall quality score
1: 87449 (14.5)
2: 83790 (11.0)
3: 80001 (15.0)
4: 77865 (14.8)
5: 77769 (13.9)
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i ~/bfren_genome/minion/runs/redbean/minlen_500/bfr_ont_500.ctg.fa -o busco -l actinopterygii_odb10 -m genome --cpu 24
BUSCO results
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk
***** Results: *****
C:95.3%[S:94.4%,D:0.9%],F:1.9%,M:2.8%,n:3640
3471 Complete BUSCOs (C)
3437 Complete and single-copy BUSCOs (S)
34 Complete and duplicated BUSCOs (D)
69 Fragmented BUSCOs (F)
100 Missing BUSCOs (M)
3640 Total BUSCO groups searched
Dependencies and versions:
hmmsearch: 3.1
metaeuk: 5.34c21f2
To polish, we will first align the same MinIon reads we just used for scaffolding (all reads >500 bp) to the scaffold assembly we just created using minimap2 (v2.17-r941).
#access node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
#load miniconda module and activate minimap2
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/minimap2
#make sure you're in the correct working directory
cd ~/minion/runs/
#run minimap2
minimap2 -ax map-ont -t 24 redbean/minlen_500/bfr_ont_500.ctg.fa bfren_ref_bigfile_pc_500.fastq.gz > bfr_ont_500_mmap.sam
# -a instructs minimap2 to output in .sam format
# -x map-ont sets the tuning parameters to presets optimized for ONT reads
Now that the reads are mapped to the draft assembly, we will build a new consensus assembly using Racon (v1.4.13).
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/racon
#run racon
racon -u -m 5 -x -4 -g -8 -w 500 -t 24 bfren_ref_bigfile_pc_500.fastq.gz bfr_ont_500_mmap.sam redbean/minlen_500/bfr_ont_500.ctg.fa > bfr_ont_500_racon.fa
# usage = racon [options ...] <mapped sequences> <alignment .sam file> <target sequences>
# -u = same as --include-unpolished; output unpolished target (reference) sequences
# -m = same as --match <int>; default is 3; score for matching bases
# -x = same as --mismatch <int>; default is -5; score for mismatching bases
# -g = same as --gap <int>, default: -4, gap penalty (must be negative)
# -w = same as --window-length <int>; default is 500; size of window on which POA is performed
# -t = number of threads
Run assembly-stats on the now polished genome.
module load miniconda3
conda activate /hb/groups/bernardi_lab/programs/assembly_stats
#run assembly-stats
assembly-stats bfr_ont_500_racon.fa > bfr_ont_500_racon.fa_assemblystats
Note: enter these stats and the original assembly stats into a spreadsheet and add a new entry after each round of polishing to keep track of how each round influences the quality of the genome.
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i ~/bfren_genome/minion/runs/bfr_ont_500_racon.fa -o busco_racon -l actinopterygii_odb10 -m genome --cpu 24
BUSCO results
# BUSCO version is: 5.2.2
# The lineage dataset is: actinopterygii_odb10 (Creation date: 2021-02-19, number of genomes: 26, number of BUSCOs: 3640)
# Summarized benchmarking in BUSCO notation for file /hb/home/jatoy/bfren_genome/minion/runs/bfr_ont_500_racon.fa
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk
***** Results: *****
C:97.4%[S:96.7%,D:0.7%],F:1.0%,M:1.6%,n:3640
3547 Complete BUSCOs (C)
3520 Complete and single-copy BUSCOs (S)
27 Complete and duplicated BUSCOs (D)
37 Fragmented BUSCOs (F)
56 Missing BUSCOs (M)
3640 Total BUSCO groups searched
Dependencies and versions:
hmmsearch: 3.1
metaeuk: 5.34c21f2
We will now polish the once-polished assembly again with the same nanopore data as last time.
First map the nanopore reads to the once-polished assembly (racon output) using minimap2:
#access node
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
#load miniconda module and activate minimap2
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/minimap2
#make sure you're in the correct working directory
cd ~/minion/runs/
#run minimap2
minimap2 -ax map-ont -t 24 bfr_ont_500_racon.fa bfren_ref_bigfile_pc_500.fastq.gz > bfr_ont_500_mmap_2.sam
# -a instructs minimap2 to output in .sam format
# -x map-ont sets the tuning parameters to presets optimized for ONT reads
Now make the consensus sequences (Racon)
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/racon
#run racon
racon -u -m 5 -x -4 -g -8 -w 500 -t 24 bfren_ref_bigfile_pc_500.fastq.gz bfr_ont_500_mmap_2.sam bfr_ont_500_racon.fa > bfr_ont_500_racon_2.fa
# usage = racon [options ...] <mapped sequences> <alignment .sam file> <target sequences>
# -u = same as --include-unpolished; output unpolished target (reference) sequences
# -m = same as --match <int>; default is 3; score for matching bases
# -x = same as --mismatch <int>; default is -5; score for mismatching bases
# -g = same as --gap <int>, default: -4, gap penalty (must be negative)
# -w = same as --window-length <int>; default is 500; size of window on which POA is performed
# -t = number of threads
[racon::Polisher::initialize] loaded target sequences 2.900655 s [racon::Polisher::initialize] loaded sequences 320.725576 s [racon::Polisher::initialize] loaded overlaps 87.950252 s [racon::Polisher::initialize] aligning overlaps [====================] 91.493828 s [racon::Polisher::initialize] transformed data into windows 52.285708 s [racon::Window::generate_consensus] warning: contig 398 might be chimeric in window 2! [racon::Polisher::polish] generating consensus [====================] 909.857116 s [racon::Polisher::] total = 1480.237076 s
Run assembly-stats again:
module load miniconda3
conda activate /hb/groups/bernardi_lab/programs/assembly_stats
#run assembly-stats
assembly-stats bfr_ont_500_racon_2.fa > bfr_ont_500_racon_2.fa_assemblystats
Run BUSCO again:
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i ~/bfren_genome/minion/runs/bfr_ont_500_racon_2.fa -o busco_racon_2 -l actinopterygii_odb10 -m genome --cpu 24
BUSCO results:
# BUSCO version is: 5.2.2
# The lineage dataset is: actinopterygii_odb10 (Creation date: 2021-02-19, number of genomes: 26, number of BUSCOs: 3640)
# Summarized benchmarking in BUSCO notation for file /hb/home/jatoy/bfren_genome/minion/runs/bfr_ont_500_racon_2.fa
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk
***** Results: *****
C:97.7%[S:97.0%,D:0.7%],F:0.9%,M:1.4%,n:3640
3555 Complete BUSCOs (C)
3529 Complete and single-copy BUSCOs (S)
26 Complete and duplicated BUSCOs (D)
32 Fragmented BUSCOs (F)
53 Missing BUSCOs (M)
3640 Total BUSCO groups searched
Dependencies and versions:
hmmsearch: 3.1
metaeuk: 5.34c21f2
First combine all read 1 files into one file and all read 2 files into another file:
cd /hb/home/jatoy/bfren_genome/illumina_reads
cat BFR_ref_CSFP210002260-1a_H3MTYDSX2_L1_1.fq BFR_ref_CSFP210002260-1a_H55VJDSX2_L2_1.fq > BFR_ref_sr_1.fq
cat BFR_ref_CSFP210002260-1a_H3MTYDSX2_L1_2.fq BFR_ref_CSFP210002260-1a_H55VJDSX2_L2_2.fq > BFR_ref_sr_2.fq
Check read quality and assess level of adapter contamination for the combined files using FastQC (v0.11.7):
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
module load fastqc
cd /hb/home/jatoy/bfren_genome/illumina_reads
fastqc BFR_ref_sr_1.fq --outdir fastqc
fastqc BFR_ref_sr_2.fq --outdir fastqc
These actually looked really clean, so I think Novogene trimmed them already.
module load trimmomatic
java -jar /hb/software/apps/trimmomatic/gnu-0.39/trimmomatic-0.39.jar PE -threads 24 -phred33 -summary summary_stats BFR_ref_sr_1.fq BFR_ref_sr_2.fq BFR_ref_sr_1_trim_paired.fq.gz BFR_ref_sr_1_trim_unpaired.fq.gz BFR_ref_sr_2_trim_paired.fq.gz BFR_ref_sr_2_trim_unpaired.fq.gz ILLUMINACLIP:/hb/software/apps/trimmomatic/gnu-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:2 TRAILING:2 MINLEN:25
Options:
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 Remove adapters
LEADING:2 Remove leading low quality or N bases (below quality 3)
TRAILING:2 Remove trailing low quality or N bases (below quality 3)
MINLEN:25 Drop reads below the 36 bases long
SLIDINGWINDOW:4:15 Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15
Trimming summary:
Input Read Pairs: 95669159
Both Surviving Reads: 95215363
Both Surviving Read Percent: 99.53
Forward Only Surviving Reads: 452066
Forward Only Surviving Read Percent: 0.47
Reverse Only Surviving Reads: 0
Reverse Only Surviving Read Percent: 0.00
Dropped Reads: 1730
Dropped Read Percent: 0.00
Check read quality and assess level of adapter contamination for the combined files using FastQC (v0.11.7):
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive
ssh $SLURM_NODELIST
module load fastqc
cd /hb/home/jatoy/bfren_genome/illumina_reads
fastqc BFR_ref_sr_1_trim_paired.fq.gz --outdir fastqc
fastqc BFR_ref_sr_2_trim_paired.fq.gz --outdir fastqc
First, create an index for the current reference (bfr_ont_500_racon_2.fa) using the bwa index command (v0.7.17-r1188):
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/bwa
bwa index ~/bfren_genome/minion/runs/bfr_ont_500_racon_2.fa
bwa index will create new files (.amb, .ann, .bwt, .pac, .sa) that will need to be within the directory to map sequences using bwa.
Now map Illumina short reads to assembly:
bwa mem -t 24 bfr_ont_500_racon_2.fa BFR_ref_sr_1_trim_paired.fq.gz BFR_ref_sr_2_trim_paired.fq.gz > BFR_ref_sr_bwa_aligned.sam
Convert the SAM file to BAM format using samtools (v1.10):
module load samtools
samtools view -Sb -@ 24 -O BAM -o BFR_ref_sr_bwa_aligned.bam BFR_ref_sr_bwa_aligned.sam
# S = input format is auto-detected
# b = output format BAM
# @ = number of threads
# O = specify output format
# o = output file name
Sort BAM file:
samtools sort -o BFR_ref_sr_bwa_aligned_sorted.bam -O BAM -@ 24 BFR_ref_sr_bwa_aligned.bam
Pilon requires the BAM input file to be indexed, so create the index file (.bai):
samtools index BFR_ref_sr_bwa_aligned_sorted.bam
Now run Pilon (v1.23) to polish second Racon assembly with the short read BAM file:
salloc --partition=256x44 --nodes=1 --time=96:00:00 --exclusive
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/pilon
java -Xmx200G -jar /hb/groups/bernardi_lab/programs/pilon/share/pilon-1.23-2/pilon-1.23.jar --genome bfr_ont_500_racon_2.fa --bam BFR_ref_sr_bwa_aligned_sorted.bam --output BFR_ref_pilon --outdir /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_1 --diploid --changes --vcf --tracks --threads 10 > /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_1/BFR_ref_pilon_out_256.log
I ran Pilon with the –diploid parameter. Remy did not do this, but according to Bruce Walker (the creator of Pilon), “Really, the only thing”–diploid" does is affect whether to report hererozygous SNPs and small indels. Since many of the initial Pilon applications were bacterial, by default it treats mixed evidence as ambiguous rather than as a heterozygous SNP or indel. Pilon’s support for diploid calls isn’t very sophisticated; in particular, it isn’t able to generate multiple haplotypes through local reassembly (it’s only calling heterzygosity via the alignment pileup information)."
Now run assembly-stats on the new Pilon assembly:
conda activate /hb/groups/bernardi_lab/programs/assembly_stats
assembly-stats BFR_ref_pilon.fasta > BFR_ref_pilon.fasta_assemblystats
Run BUSCO on Pilon assembly:
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_1/BFR_ref_pilon.fasta -o busco_pilon_1 -l actinopterygii_odb10 -m genome --cpu 24
# BWA (Burrows-Wheeler Alignment) First, create an index for the current reference (BFR_ref_pilon.fasta) using the bwa index command (v0.7.17-r1188):
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/bwa
bwa index ~/bfren_genome/illumina_reads/pilon/polish_1/BFR_ref_pilon.fasta
bwa index will create new files (.amb, .ann, .bwt, .pac, .sa) that will need to be within the directory to map sequences using bwa.
Now map Illumina short reads to assembly:
bwa mem -t 24 ~/bfren_genome/illumina_reads/pilon/polish_1/BFR_ref_pilon.fasta ~/bfren_genome/illumina_reads/BFR_ref_sr_1_trim_paired.fq.gz ~/bfren_genome/illumina_reads/BFR_ref_sr_2_trim_paired.fq.gz > BFR_ref_sr_bwa_aligned_2.sam
Convert the SAM file to BAM format using samtools (v1.10):
module load samtools
samtools view -Sb -@ 24 -O BAM -o BFR_ref_sr_bwa_aligned_2.bam BFR_ref_sr_bwa_aligned_2.sam
# S = input format is auto-detected
# b = output format BAM
# @ = number of threads
# O = specify output format
# o = output file name
Sort BAM file:
samtools sort -o BFR_ref_sr_bwa_aligned_sorted_2.bam -O BAM -@ 24 BFR_ref_sr_bwa_aligned_2.bam
Pilon requires the BAM input file to be indexed, so create the index file (.bai):
samtools index BFR_ref_sr_bwa_aligned_sorted_2.bam
Now run Pilon (v1.23) to polish the first Pilon assembly with the new short read BAM file:
salloc --partition=256x44 --nodes=1 --time=96:00:00 --exclusive
ssh $SLURM_NODELIST
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/pilon
java -Xmx200G -jar /hb/groups/bernardi_lab/programs/pilon/share/pilon-1.23-2/pilon-1.23.jar --genome ~/bfren_genome/illumina_reads/pilon/polish_1/BFR_ref_pilon.fasta --bam /hb/home/jatoy/bfren_genome/illumina_reads/bwa/polish_2/BFR_ref_sr_bwa_aligned_sorted_2.bam --output BFR_ref_pilon_2 --outdir /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_2 --diploid --changes --vcf --tracks --threads 10 > /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_2/BFR_ref_pilon_2_out.log
Now run assembly-stats on the new Pilon assembly:
conda activate /hb/groups/bernardi_lab/programs/assembly_stats
assembly-stats BFR_ref_pilon_2.fasta > BFR_ref_pilon_2.fasta_assemblystats
Run BUSCO on Pilon assembly:
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_2/BFR_ref_pilon_2.fasta -o busco_pilon_2 -l actinopterygii_odb10 -m genome --cpu 24
salloc --partition=128x24 --nodes=1 --time=96:00:00 --exclusive
ssh $SLURM_NODELIST
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/blast
#set path for blast databases
export PATH=$PATH:/hb/groups/bernardi_lab/programs/blast
export BLASTDB=/hb/groups/bernardi_lab/programs/blobtools2/nt
#run blast
blastn -db nt -query /hb/home/jatoy/bfren_genome/illumina_reads/pilon/polish_2/BFR_ref_pilon_2.fasta -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-20 -out /hb/home/jatoy/bfren_genome/BFR_ref_pilon_2_blast.out -num_threads 12
# qseqid = Query Seq-id
# staxids = unique Subject Taxonomy ID(s), separated by a ';'(in numerical order)
# std = default format specifiers = 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'
-max_target_seqs = Number of aligned sequences to keep. Use with report formats that do not have separate definition line and alignment sections such as tabular (all outfmt > 4). Not compatible with num_descriptions or num_alignments. Ties are broken by order of sequences in the database.
IMPORTANT! - When using -max_target_seqs, “BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.” (Shah et al. 2019 Bioinformatics)
-max_hsps = Maximum number of HSPs (alignments) to keep for any single query-subject pair. The HSPs shown will be the best as judged by expect value. This number should be an integer that is one or greater. If this option is not set, BLAST shows all HSPs meeting the expect value criteria. Setting it to one will show only the best HSP for every query-subject pair.
-outfmt = Output format. The format specified above is the default format expected by blobtools2 as input.
More details on BLAST parameters here: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/
First, create an index for the final reference (BFR_ref_pilon_2.fasta) using the bwa index command (v0.7.17-r1188):
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/bwa
bwa index ~/bfren_genome/illumina_reads/pilon/polish_2/BFR_ref_pilon_2.fasta
bwa index will create new files (.amb, .ann, .bwt, .pac, .sa) that will need to be within the directory to map sequences using bwa.
Now map Illumina short reads to assembly:
bwa mem -t 24 ~/bfren_genome/illumina_reads/pilon/polish_2/BFR_ref_pilon_2.fasta ~/bfren_genome/illumina_reads/BFR_ref_sr_1_trim_paired.fq.gz ~/bfren_genome/illumina_reads/BFR_ref_sr_2_trim_paired.fq.gz > BFR_ref_sr_bwa_aligned_3.sam
Convert the SAM file to BAM format using samtools (v1.10):
module load samtools
samtools view -Sb -@ 24 -O BAM -o BFR_ref_sr_bwa_aligned_3.bam BFR_ref_sr_bwa_aligned_3.sam
# S = input format is auto-detected
# b = output format BAM
# @ = number of threads
# O = specify output format
# o = output file name
Sort BAM file:
samtools sort -o BFR_ref_sr_bwa_aligned_sorted_3.bam -O BAM -@ 24 BFR_ref_sr_bwa_aligned_3.bam
To run Blobtools2 you will need:
Your final assembly fasta file BFR_ref_pilon_2.fasta
Your .bam coverage file for the final assembly (e.g., using bwa to map illumina sequences to your final assembly) BFR_ref_sr_bwa_aligned_sorted_3.bam
Your BLAST output for the final assembly BFR_ref_pilon_2_blast.out
(optional) The BUSCO output full table for your final assembly full_table.tsv
Once you have all the files, running blobtools is relatively easy and fast.
Start an interactive session:
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive --cpus-per-task=8
ssh $SLURM_NODELIST
Now create your blobtools database and populate it with the coverage data, taxonomic information, and BUSCO scores:
#------- load programs --------------
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/blobtools2
#------- Run Commands ---------------
#Assign source directory variable to environment
SRC=/hb/home/jatoy/bfren_genome/
#Create a BlobDir
blobtools add --create --fasta $SRC/illumina_reads/pilon/polish_2/BFR_ref_pilon_2.fasta $SRC/blobtools2/BFR_ref_pilon_2_blob
#Add BLAST hits
blobtools add --threads 8 --hits $SRC/BFR_ref_pilon_2_blast.out --taxrule bestsumorder --taxdump /hb/groups/bernardi_lab/programs/blobtools2/taxdump $SRC/blobtools2/BFR_ref_pilon_2_blob
#Add mapping coverage
blobtools add --threads 8 --cov $SRC/illumina_reads/bwa/final_mapping_for_blobtools/BFR_ref_sr_bwa_aligned_sorted_3.bam $SRC/blobtools2/BFR_ref_pilon_2_blob
#Add BUSCO
blobtools add --threads 8 --busco $SRC/illumina_reads/pilon/busco_pilon_2/run_actinopterygii_odb10/full_table.tsv $SRC/blobtools2/BFR_ref_pilon_2_blob
--taxrule: BlobTools2 assigns a putative taxonomic origin to each scaffold at 8 ranks from superkingdom to species based on aggregating hits in the –hits files. The –taxrule flag determines the rule to use when assigning BLAST hits to taxa. Two options are available: “bestsum” and “bestsumorder”. Of the taxa represented in the BLAST hits, “bestsum” assigns the taxon for which the sum of bitscores is greatest across all files while the default “bestsumorder” taxrule uses the maximum bitscore in the first file and only uses hits from subsequent files for taxa without a hit in previous files.
First download Miniconda installer (from Anaconda website) and install via WSL command line:
bash Miniconda3-latest-Linux-x86_64.sh
Now install blobtools2:
#create directory for blobtookit
mkdir blobtoolkit
cd blobtoolkit
#create conda env with dependencies
conda create -n blobtools2 -y python=3.6 docopt pyyaml ujson pysam tqdm nodejs seqtk
conda activate blobtools2
#download taxdump database
mkdir -p taxdump
cd taxdump
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar xzf -
cd ..
#download blobtools2
git clone https://github.com/blobtoolkit/blobtools2
git clone https://github.com/blobtoolkit/specification
git clone https://github.com/blobtoolkit/insdc-pipeline
#download and install blobtoolkit viewer
git clone https://github.com/blobtoolkit/viewer
cd viewer
npm install
cd ..
#add new directories to path
export PATH=~/blobtoolkit/blobtools2:~/blobtoolkit/specification:~/blobtoolkit/insdc-pipeline/scripts:$PATH
#now run pip to install everything
pip install blobtoolkit
Download BFR_ref_pilon_2_blob folder to desktop and run blobtools view.
scp -r jatoy@hb.ucsc.edu:~/bfren_genome/blobtools2/BFR_ref_pilon_2_blob/ ./winhome/OneDrive/Documents/ucsc/projects/landscape_genomics/kelp_perch/reference_genome/blobtools
Create interactive html page with the blobtools results:
#activate conda env
conda activate blobtools2
#change to desired directory
cd ./winhome/OneDrive/Documents/ucsc/projects/landscape_genomics/kelp_perch/reference_genome/blobtools
#run view command
blobtools view --remote BFR_ref_pilon_2_blob
Running blobtools view will output a local url that you can copy and paste into a web browser to view the figures and summary:
Blob plot
Cumulative composition plot
Snail plot
Save the data table from the “table” tab in the viewer as a .csv. Open in excel and sort by “bestsumorder phylum”. This will allow you to see which contigs were identified as contaminant taxa.
On the cluster, make a file with the list of contig numbers to KEEP in the assembly by copying and pasting into a new text file (BFR_ref_pilon_2_ctg_clean.txt).
nano BFR_ref_pilon_2_ctg_clean.txt
head BFR_ref_pilon_2_ctg_clean.txt
ctg1_pilon_pilon
ctg2_pilon_pilon
ctg3_pilon_pilon
ctg4_pilon_pilon
ctg5_pilon_pilon
ctg6_pilon_pilon
ctg7_pilon_pilon
ctg8_pilon_pilon
ctg9_pilon_pilon
ctg10_pilon_pilon
Using samtools v(1.14), we will extract from the assembly only the contigs identified as Chordata or no hit. Contigs identified as Proteobacteria will be excluded.
module load samtools
cd ~/bfren_genome/blobtools2
xargs samtools faidx ../illumina_reads/pilon/polish_2/BFR_ref_pilon_2.fasta < BFR_ref_pilon_2_ctg_clean.txt > BFR_ref_clean.fasta
Check final assembly stats
conda activate /hb/groups/bernardi_lab/programs/assembly_stats
assembly-stats BFR_ref_clean.fasta
stats for BFR_ref_clean.fasta
sum = 595959787, n = 1003, ave = 594177.26, largest = 12326090
N50 = 2589815, n = 55
N60 = 1896895, n = 82
N70 = 1185759, n = 122
N80 = 758445, n = 184
N90 = 406612, n = 292
N100 = 1298, n = 1003
N_count = 0
Gaps = 0
In total, blobtools removed 5 contigs containing 27,511 bp.
salloc --partition=128x24 --nodes=1 --time=96:00:00 --exclusive
ssh $SLURM_NODELIST
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i /hb/home/jatoy/bfren_genome/blobtools2/BFR_ref_clean.fasta -o busco_blob_clean -l actinopterygii_odb10 -m genome --cpu 24
C:98.1%[S:97.4%,D:0.7%],F:0.5%,M:1.4%,n:3640
3572 Complete BUSCOs (C)
3547 Complete and single-copy BUSCOs (S)
25 Complete and duplicated BUSCOs (D)
20 Fragmented BUSCOs (F)
48 Missing BUSCOs (M)
3640 Total BUSCO groups searched
BUSCO scores were not changed by the removal of contaminant sequences.
I would now like to order my contigs from largest to smallest and rename contigs to a sequential order.
Right now my fasta file looks like this:
grep ">" BFR_ref_clean.fasta | head
>ctg1_pilon_pilon
>ctg2_pilon_pilon
>ctg3_pilon_pilon
>ctg4_pilon_pilon
>ctg5_pilon_pilon
>ctg6_pilon_pilon
>ctg7_pilon_pilon
>ctg8_pilon_pilon
>ctg9_pilon_pilon
>ctg10_pilon_pilon
grep ">" BFR_ref_clean.fasta | tail
>ctg911_pilon_pilon
>ctg940_pilon_pilon
>ctg928_pilon_pilon
>ctg945_pilon_pilon
>ctg935_pilon_pilon
>ctg957_pilon_pilon
>ctg943_pilon_pilon
>ctg833_pilon_pilon
>ctg963_pilon_pilon
>ctg993_pilon_pilon
We will use seqkit (v2.1.0) to order contigs by length and rename
conda activate /hb/groups/bernardi_lab/programs/seqkit
#order contigs by length
seqkit sort --by-length --reverse BFR_ref_clean.fasta > BFR_ref_clean_ordered.fasta
#rename contigs
seqkit replace --pattern '.+' --replacement 'Contig_{nr}' BFR_ref_clean_ordered.fasta > BFR_ref_clean_ordered_renamed.fasta
#where {nr} is the record number, starting from 1
#"." in regex means "any character except line break"
#"+" in regex means "occurring one or more times"
#so "replace --pattern '.+'" basically means "replace all characters in the contig name
After sorting by length:
grep ">" BFR_ref_clean_ordered.fasta | head
>ctg24_pilon_pilon
>ctg1_pilon_pilon
>ctg4_pilon_pilon
>ctg2_pilon_pilon
>ctg6_pilon_pilon
>ctg28_pilon_pilon
>ctg10_pilon_pilon
>ctg36_pilon_pilon
>ctg83_pilon_pilon
>ctg21_pilon_pilon
After renaming:
grep ">" BFR_ref_clean_ordered_renamed.fasta | head
>Contig_1
>Contig_2
>Contig_3
>Contig_4
>Contig_5
>Contig_6
>Contig_7
>Contig_8
>Contig_9
>Contig_10
grep ">" BFR_ref_clean_ordered_renamed.fasta | tail
>Contig_994
>Contig_995
>Contig_996
>Contig_997
>Contig_998
>Contig_999
>Contig_1000
>Contig_1001
>Contig_1002
>Contig_1003
The final genome assembly BFR_ref_ordered_renamed.fasta will be renamed to BFR_ref_final.fasta
cp BFR_ref_clean_ordered_renamed.fasta BFR_ref_final.fasta
mv ../final_assembly
Get final assembly stats (should be same as before, since all we did was reorder contigs and change names):
conda activate /hb/groups/bernardi_lab/programs/assembly_stats/
assembly-stats BFR_ref_final.fasta
stats for BFR_ref_final.fasta
sum = 595959787, n = 1003, ave = 594177.26, largest = 12326090
N50 = 2589815, n = 55
N60 = 1896895, n = 82
N70 = 1185759, n = 122
N80 = 758445, n = 184
N90 = 406612, n = 292
N100 = 1298, n = 1003
N_count = 0
Gaps = 0
Extract the lengths of each contig in the fasta file and sort from largest to smallest. Our file has just been sorted from largest to smallest so it should already show a decreasing contig length.
### Using samtools (v1.14) ###
module load samtools
#index fasta
samtools faidx BFR_ref_final.fasta
#select only first two fields
cut -f1-2 BFR_ref_final.fasta.fai > BFR_ref_final_lengths
### Using seqkit (v2.1.0) and including gc content ###
conda activate /hb/groups/bernardi_lab/programs/seqkit/
seqkit fx2tab --name --length --gc BFR_ref_final.fasta > BFR_ref_final_lengths_gc
Download both files
scp jatoy@hb.ucsc.edu:~/bfren_genome/final_assembly/BFR_ref_final_lengths* ./winhome/OneDrive/Documents/ucsc/projects/landscape_genomics/kelp_perch/reference_genome/
Plot contig lengths (from largest to smallest)
#load data and packages
setwd("C:/Users/jason/OneDrive/Documents/ucsc/projects/landscape_genomics/kelp_perch/reference_genome")
library(tidyverse)
bfr <- read_tsv("BFR_ref_final_lengths_gc.tsv", col_names = c("contig", "length", "gc")) %>%
mutate(id = row_number())
ggplot(data = bfr) +
geom_col(aes(x = id, y = length/1000000)) +
theme_bw() +
xlab("Contig") +
ylab("Length (Mb)")
Plot gc content by contig
ggplot(data = bfr) +
geom_col(aes(x = id, y = gc)) +
theme_bw() +
xlab("Contig") +
ylab("GC")
Plot gc content by contig length
ggplot(data = bfr) +
geom_point(aes(x = length/1000000, y = gc)) +
theme_bw() +
xlab("Contig length (Mb)") +
ylab("GC")
salloc --partition=128x24 --nodes=1 --time=96:00:00 --exclusive
ssh $SLURM_NODELIST
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/blast
#set path for blast databases
export PATH=$PATH:/hb/groups/bernardi_lab/programs/blast
export BLASTDB=/hb/groups/bernardi_lab/programs/blobtools2/nt
#unzip fasta
gzip -d BFR_ref_final.fasta.gz
#run blast
blastn -db nt -query /hb/home/jatoy/bfren_genome/final_assembly/BFR_ref_final.fasta -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-20 -out /hb/home/jatoy/bfren_genome/final_assembly/BFR_ref_final_blast.out -num_threads 18
# qseqid = Query Seq-id
# staxids = unique Subject Taxonomy ID(s), separated by a ';'(in numerical order)
# std = default format specifiers = 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'
First, create an index for the final assembly (BFR_ref_final.fasta) using the bwa index command (v0.7.17-r1188):
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/bwa
bwa index ~/bfren_genome/final_assembly/BFR_ref_final.fasta
bwa index will create new files (.amb, .ann, .bwt, .pac, .sa) that will need to be within the directory to map sequences using bwa.
Now map Illumina short reads to assembly:
bwa mem -t 24 ~/bfren_genome/final_assembly/BFR_ref_final.fasta ~/bfren_genome/illumina_reads/BFR_ref_sr_1_trim_paired.fq.gz ~/bfren_genome/illumina_reads/BFR_ref_sr_2_trim_paired.fq.gz > ~/bfren_genome/final_assembly/BFR_ref_sr_bwa_aligned_final.sam
Convert the SAM file to BAM format using samtools (v1.10):
module load samtools
samtools view -Sb -@ 24 -O BAM -o BFR_ref_sr_bwa_aligned_final.bam BFR_ref_sr_bwa_aligned_final.sam
# S = input format is auto-detected
# b = output format BAM
# @ = number of threads
# O = specify output format
# o = output file name
Sort BAM file:
samtools sort -o BFR_ref_sr_bwa_aligned_sorted_final.bam -O BAM -@ 24 BFR_ref_sr_bwa_aligned_final.bam
Run BUSCO on final assembly (should be the same as the last BUSCO but we will run it again anyway:
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/busco_v5.2.2
busco -i /hb/home/jatoy/bfren_genome/final_assembly/BFR_ref_final.fasta -o busco_final -l actinopterygii_odb10 -m genome --cpu 24
Start an interactive session:
salloc --partition=128x24 --nodes=1 --time=48:00:00 --exclusive --cpus-per-task=8
ssh $SLURM_NODELIST
Now create your blobtools database and populate it with the coverage data, taxonomic information, and BUSCO scores:
#------- load programs --------------
module load miniconda3.9
conda activate /hb/groups/bernardi_lab/programs/blobtools2
#------- Run Commands ---------------
#Assign source directory variable to environment
SRC=/hb/home/jatoy/bfren_genome/
#Create a BlobDir
blobtools add --create --fasta $SRC/final_assembly/BFR_ref_final.fasta $SRC/blobtools2/BFR_ref_final_blob
#Add BLAST hits
blobtools add --threads 8 --hits $SRC/final_assembly/BFR_ref_final_blast.out --taxrule bestsumorder --taxdump /hb/groups/bernardi_lab/programs/blobtools2/taxdump $SRC/blobtools2/BFR_ref_final_blob
#Add mapping coverage
blobtools add --threads 8 --cov $SRC/final_assembly/BFR_ref_sr_bwa_aligned_sorted_final.bam $SRC/blobtools2/BFR_ref_final_blob
#Add BUSCO
blobtools add --threads 8 --busco $SRC/final_assembly/busco_final/run_actinopterygii_odb10/full_table.tsv $SRC/blobtools2/BFR_ref_final_blob
Download BFR_ref_pilon_2_blob folder to desktop and run blobtools view.
scp -r jatoy@hb.ucsc.edu:~/bfren_genome/blobtools2/BFR_ref_final_blob/ ./winhome/OneDrive/Documents/ucsc/projects/landscape_genomics/kelp_perch/reference_genome/blobtools
Create interactive html page with the blobtools results:
#activate conda env
conda activate blobtools2
#change to desired directory
cd ./winhome/OneDrive/Documents/ucsc/projects/landscape_genomics/kelp_perch/reference_genome/blobtools
#run view command
blobtools view --remote BFR_ref_final_blob
Save all figures and data tables.
Final assembly blob plot (post-cleaning):
Final assembly snailplot (post-cleaning):